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ABSTRACT 

Bias due to imperfect shear calibration is the biggest obstacle when constraints on cosmological 
parameters are to be extracted from large area weak lensing surveys such as Pan-STARRS-37r, 
DES or future satellite missions like Euclid. 

We demonstrate that bias present in existing shear measurement pipelines (e.g. KSB) can be 
almost entirely removed by means of neural networks. In this way, bias correction can depend 
on the properties of the individual galaxy instead on being a single global value. We present a 
procedure to train neural networks for shear estimation and apply this to subsets of simulated 
GREAT08 RealNoise data. 

We also show that circularization of the PSF before measuring the shear reduces the scatter 
related to the PSF anisotropy correction and thus leads to improved measurements, particularly 
on low and medium signal-to-noise data. 

Our results are competitive with the best performers in the GREAT08 competition, especially 
for the medium and higher signal-to-noise sets. Expressed in terms of the quality parameter 
defined by GREAT08 we achieve a Q « 40, 140 and 1300 without and 50, 200 and 1300 with 
circularization for low, medium and high signal-to-noise data sets, respectively. 

Subject headings: Cosmology: observations — gravitational lensing — methods: data analysis — surveys 



1. Introduction 

Weak gravitational lensing has proven to be a 
versatile method for measuring the mass distribu- 
tion of galaxy clusters. With the detection of cos- 
mic shear it has turned into an important tool for 
providing constraints on cosmo l ogical parameters 



(2008)) 



such as erg and £l (see lFu et al. 

Common to all applications of weak lensing 
studies is the requirement for statistical analy- 
sis of a great number of objects. Single sheared 
galaxies, due to their unknown intrinsic ellipticity 
and further observational uncertainties, can only 
give a reasonable shear estimate as part of a large 
sample. The accuracy of shear estimation meth- 
ods poses a bottleneck to observational cosmol- 
ogy. Especially as surveys are getting larger (Pan- 



STARRS-37T, DES, Euclid), shear calibration bias 
could annihilate the gain from improved statistics 
for larger galaxy samples. 

For this reason several competitions for the cal- 
ibration of shear measurement pipelines and the 
development of improved methods using simulated 
data have been hosted^] The elimination of biases 
among the many methods presented there has, 
however, only been partially successful. 

In this paper we make use of artificial neu- 
ral networks in order to improve the exist- 
ing shear measure ment pipeline introduced by 
Ka ser et all (Il995h [K SgJ and fu rther developed 



by 



Luppino fc Kaiser (1997) and Hoekstra et al.l 



(|1998I ). We a pply these to the data simulated 
by GREAT08 ([Bridle et al1l2010l ) and show that 
existing biases can be almost entirely removed. 
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2. Motivation 

The fundamentals of the KSB method are to 
measure galaxy shapes derived from second mo- 
ments Qij integrated within a circular Gaussian 
weight function. From these, polarizations can be 
defined as 
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Observed polarizations e obs must be corrected 
for PSF anisotropy p and their responsitivity to 
shear g as based on PSF size and the galaxy's 
shape. KSB achieves this by linear corrections, 
such that 



,obs _ „tmc ,_ psm p + pl g ^ 



e = e 



where P sm is the galaxy's smear polarizability ten- 
sor and P 7 is calculated as 



pl _ psa _ ps 
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(3) 



from smear polarizability tensors and shear polar- 
izability tensors P sh measured on the galaxy and 
the PSF (denoted with a star). These tensors are 
weighted fourth order moments of the respective 
light distributions. 

Assuming that galaxies show no intrinsic align- 
ment on the sky, one can thus obtain a shear esti- 
mate as 



= (P 7 \e ohs -P sm p)) 



(4) 



There exist a large number of implementations 
of KSB, differing from each other in subtle choices 
of the method for source extraction, determina- 
tion of the radius for the weight function r g , P 7 
tensor inversion, weighting, cuts (e.g., eliminat- 
ing objects with large ellipticities or small val- 
ues of P 7 ), correction factors and further details 
( Hevmans et alJfeOQGl) . It is not clear a priori and 
will likely depend on the particular data set which 
is the right choice on each of these points. Differ- 
ent decisions imply different biases (see STEP2, 
Massev et al.l (|2007l )). which have to be taken into 
account. 

The bias introduced by not having calibrated a 
method correctly can only be avoided by carefully 
simulating the process for the very data that is to 



be analyzed. This means that data with known 
shear have to be simulated. It must be the aim 
of these simulations to reproduce the properties of 
the respective sample of galaxies as accurately as 
possible (e.g., in terms of intrinsic light distribu- 
tion, PSF, errors introduced by the data reduction 
and noise properties) in order for the calibrations 
to be appropriate. 

In some cases using simulations a shear calibra- 
tion bias has been found and corre cted manually. 
For instance, T. Schrabback and Mclnnes et al 



( 20091 ) multiply their shear estimates by 1/0.91 
and 1/0.82, respectively, with factors found by cal- 
ibration for STEP1 and STEP2 data. After such 
manual corrections, however, it is very likely that 
there is a residual bias, not only because the cor- 
rections done are usually very simple but also be- 
cause bias likely differs for different galaxy prop- 
erties and data sets. 

We propose to make best use of the anyhow 
required simulations by letting neural networks 
analyze the data and estimate shear after train- 
ing them on simulated data with known shear. 
Using a pipeline's ellipticity estimates and fur- 
ther parameters that might be indicative for the 
bias present on the respective galaxy, we test how 
well neural networks are able to eliminate biases 
and improve the shear estimate. Provided that 
such a scheme is successful, correct simulations al- 
low for optimal calibration of shear measurement 
pipelines. 

We point out that it is not the subject of this 
work to use neural networks on the pixelated light 
distribution of the galaxies itself, but to start from 
data which are quite close to an exact shear mea- 
surement already. The advantage of this method 
is that the network is fed with the most rele- 
vant parameters for shape estimates on a catalog 
basis, keeping training and application compara- 
tively computation inexpensive. 

3. Neural networks 

Neural networks are nowadays commonly used 
in astronomy, be it for the detection and classifica- 
tion of obj ects or for finding photom etric redshift 
estimates (|Collister fc Lahavl [2004) . The flavor 



of networks most frequently used and also to be 
employed in this paper is multi-layer Perceptrons. 
From inputs a™ fed to an input layer of neurons, 
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parameters are transferred through a number of 
hidden layers to finally make for one or more net- 
work outputs a;™' (see Figure [I]) . 




hidden 
layers 



Fig. 1.- 
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sketch of a Perceptron with two hidden 




A neuron i's output xi depends on the incom- 
ing signal from connected nodes j in the previ- 
ous layer, weighted by connection weights Wji and 
transformed by the neuron's activation function 



(5) 



While in general fi could be chosen differently 
for each node, it is usually taken to be the same 
nonlinear function for all hidden nodes and the 
identity for the input and output layers' nodes. 
In our case we use a sigmoidal function fi(a) = 
f(a) = (e _a + l) _1 , where node i is a hidden node. 
The weights Wij of the connections between two 
adjacent layers' nodes i and j and from an addi- 
tional bias node which accounts for an individual 
node threshold are to be optimized such that a cost 
function E of the network output becomes mini- 
mal. A typical choice for the cost function is the 



sum over squared errors of outputs x£ ut on training 
sets k, for which true answers x k are known. An 
additional term quadratic in the weights is added 
for regularization, penalizing large weights which 
characterize overfitting to specific data. Thus E 
becomes 



E 



Xk) 



(6) 



For training such a network, true answers for 
each training set h ave to be known, such th at error 
back-propagation (jRumelhart et al.lll986l ) can be 
used for optimizing the weights @ 

However, for weak lensing measurements we 
can only expect the network to return a true shear 
component gi =: xi on average for a large sample 
I of galaxies. Training with true shear as the ex- 
pected network output for all single galaxy data 
is counterproductive. We would rather like the 
networks to minimize the squared error between 
gi and (ek) l k ='■ (x1 nt ) l k , i.e. the average output 
of galaxies k on a sample of galaxies I, for each 
shear/ellipticity component. We can express this 
with a cost function of the form 



E = J2(K Ut ) l k -ii) 2 +aJ2 



(7) 



The back-propagation algorithm must in this case 
be adapt ed accordingly. W e start from code pro- 
vided by lCollister fc Lahav (2004) and implement 
the algorithm as described in the appendix (sec- 
tion DO). 



4. Application to GREAT08 data 

For training and testing our network with the 
scheme described in more detail in the appendix, 
we use simulated galaxy images with known shear 
from the "RealNois e Blind" data sets of GREAT08 
(|Bridle et al.ll2009t) . 



4.1. Sample selection 

Table [T] gives an overview of the six samples 
analyzed in our work. Each of the samples con- 
tains sets of six image files of 10,000 simulated 
galaxies, each set being sheared with a true shear 
(Qi,Q2) £ [-0.05,0 .05l 2 as plotted in Figure Al of 
Bridle et al.1 (<2010h . 



2 see appendix for a description of the algorithm, section [A.ll 
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From the 2700 sets of galaxies in the "RealNoise 
Blind" sample of GREAT08, we pick those 1500 
sets from the fiducial (medium) signal to noise 
group which share the same PSF (the fiducial PSF 
also labeled PSF 1 by GREAT08) but differ in 
terms of galaxy size and type. In the following 
analysis we denote these as sample 1. 

In order to find how well such corrections can 
work on galaxies with different signal-to-noise lev- 
els, we pick two more samples. As these are homo- 
geneous in all galaxy property distributions, they 
are not as realistic as sample 1 but still can give 
an indication of the dependence of network perfor- 
mance on signal-to-noise ratio. Therefore, we per- 
form a similar analysis with the 300 high signal to 
noise data sets (sample 2) and the 300 low signal 
to noise sets (sample 3), which all share the same 
galaxy properties and PSF. It should be noted, 
though, that in the latter case signal is so low that 
without an input catalog source extraction suffers 
a significant rate of false detections. On real sin- 
gle frame data with similar signal to noise ratio, 
the centroid position on stacked frames could be 
used to eliminate false detections and have well- 
defined centroid positions, improving the accuracy 
of the measurement. In our case we cross-correlate 
catalogs against the GREAT08 grid positions to 
achieve complete and clean detections. 

4-1.1. Circularized samples 

In order to study the influence of PSF cir- 
cularization as an alternative method of PSF 
anisotropy correction, we repeat the analysis with 
three more samples. 

A sample of all galaxies with medium signal-to- 
noise level is denoted as sample lc. Unlike sam- 
ple 1, it contains data with all three PSFs used 
in the GREAT08 challenge. For this sample we 
circularize all three PSFs to the same circular tar- 
get PSF before run ning KSB. We are usin g the 
method described b ylAlard fe Luptonl (Il998l) in its 
implementation bv lGoessl k, Riffeserl (|2002n . The 
target PSF is similar to the initial ones but slightly 
larger and circular, with a Moffat profile of 3 pixels 
FWHM and /3 = 3.2. After having built a model of 
the initial PSF using 100 stars each, we convolve 
the data with a kernel model consisting of a su- 
perposition of four Gaussians with a = 1, 3, 9, 0.1 
multiplied with polynomials in x and y of order 
n = 6,4,2,0, respectively. We do a x 2 fit of the 



50 model parameters to reach our target PSF by 
convolution with this kernel. 

Due to the larger size of sample lc, we reserve 
a larger number of galaxies for blind testing the 
networks later on. We prepare two more similarly 
circularized samples 2c and 3c using the data from 
samples 2 and 3. These data sets are homogeneous 
in all galaxy properties and therefore not as real- 
istic as sample lc. 

4.2. Running KSB 

On the 2700 sets of 10000 galaxies each from 
the GREAT08 "RealNoise Blind" challenge we run 
a KSB implementation KSBg, based on the ver- 
sion a ssembled by T. Schr abback an d denoted as 
TS in lMassev et all (|2007l ). (See also lErben et al. 

Horn])). 

After source extraction with SExtractor the 
pipeline uses analyse to calculate for each galaxy 
the tensors P sm and P sh PSF anisotropy correc- 
tion is done with this and P 7 is computed as 



P 7 _ psh _ tr(P sh '*) 

nsm 



(8) 



stars denoting quantities measured on the PSF. 

This tensor is applied to the measured polar- 
izations using trace inversion to find individual 
galaxy shear estimates 



2(e obs - P sm p) 
tr(P 7 ) 



(9) 



Objects with tr(P 7 ) < 0.1 are discarded. The 
estimate labeled as KSB5 in the following analysis 
is always e lso /0.91, scaled with a calibration factor 
as optimized for this implementation of KSB using 
STEP1 data. 

4.3. Neural network training 

From the output of the KSB5 pipeline we take 
e lso as the starting point for neural network anal- 
ysis. As potential predictors for bias we add the 
weight function radius r g which in our pipeline 
is equal to SExtractor's FLUX_RADIUS, the flux as 
measured by analyse, all four components of P 7 
and the pipeline's error estimates for the initial 
shape measurements Ae. 



3 cf. eqn.[2]to|l 
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Table 1 

Properties of the six samples used for neural network analysis, see section 14.11 



sample 


galaxy sets a 


s/n b 


PSF C 


galaxy properties' 1 


gradient sets c 


validation scts c 


blind sets f 




1 


1500 


20 


1 


mixed 


1394 


53 


53 


lc g 


2100 


20 


1/2/3 


mixed 


867 


33 


1200 h 


2 


300 


40 


1 


fiducial 


260 


10 


30 


2c g 


300 


40 


1 


fiducial 


260 


10 


30 


3 


300 


10 


1 


fiducial 


260 


10 


30 


3c g 


300 


10 


1 


fiducial 


260 


10 


30 



a each galaxy set contains 10000 galaxies 

b as defined by Bridle ct al. (2010), the s/n estimate from the KSB5 pipeline is considerably lower 
c using the PSF name convention as in IBridle et all J2009T) 

d GREAT08 simulate galaxy sets with different galaxy sizes and galaxy sets featuring cither concentric bulge and 
disc, off-center bulge and disc or only one of cither bulge or disc models; the fiducial group corresponds to medium 
size and concentric bulge and disc; as real data will typically contain a diversity of galaxy properties, samples 1 
and lc come closer to realistic samples 

e the sets used for training arc split up randomly for each of the 500 differently initialized and trained networks 
into sets used for finding the gradient and sets used for validating the solution during the training process 

f these are galaxy sets put aside and not used for training, validating or selecting the networks 

s c stands for circularization; we have circularized all three anisotropic PSFs to the same circular PSF in this 
sample 

h a large number of blind sets have been reserved for extensive blind testing on this sample 



The networks used feature three hidden lay- 
ers of ten nodes each for both components and 
are trained using the algorithm describec0 and 
the true shears published by GREAT08 after the 
end of the challenge. We split each sample into 
a subset used for training and a subset for later 
blind testing of network performance. Following 
the optimi zed ratio of gradien t to validation sets 
derived bv lAmari et ali (|l995h for the asymptotic 
case of many available sets, we split the training 
sets again into subsets used for determining the 
gradient and others required for validation during 
training. The respective sizes of subsets are given 
in Table Q] For each sample the training process 
is iterated 500 times with different random initial 
weight configurations and a random allocation of 
training sets into gradient and validation sets. 

4.4. Selecting and blind testing networks 

It is necessary to ensure for evaluating the net- 
works or any real-world application that the net- 
works trained really perform consistently well on 
the data used for training and similar data not 
used for training or selecting the networks, in our 



case the blind sets (cf. Table [TJ. 

Overtrained networks are generally character- 
ized by some weights becoming comparatively 
large. While a penalization of large weights by 
the second term in eqn. [7] already reduces over- 
fitting to the training sample, where the number 
of training sets is sufficiently small overtraining 
may still occur because the reduction in errors 
from overfitting outweights the penalization due 
to large weights. For all following analyses, we 
therefore use the sum of squared weights, 



i,3 



(10) 



4 see appendix, section [A. 21 



to discard networks which are more than la 
above the average in S for the sample of networks 
cropped at 1.2 times the median S. This deselects 
about half of the networks on each of the samples, 
some of which might in fact not be overtrained. 
In the presence of larger samples, therefore, when 
more careful selection of networks is possible, per- 
formance might still increase. For all following 
analyses, we only use the weight-selected networks 
not discarded by these criteria. 

To compare the performance of the networks 
left on training and blind data, we plot the 
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root mean square error of the shear g° measured 
against the true shear g\, 



rms = ^((go-gty) 



(11) 



which the weight-selected networks achieve on the 
data used for training and the blind sets. The plot 
shown in Figure [2] shows the result for sample lc, 
component 1, for which the blind rms and train- 
ing rms are equal within statistical uncertainty. 
For the other samples and components, due to 
the smaller number of blind sets, scatter is signif- 
icantly larger and small constant offsets from the 
identity in both directions occur, likely due to the 
particular properties of the blind sets. In all cases, 
the networks performing best on the training data 
perform consistently well on blind data. 



^_0.002 



6 
c 




0.001 0.002 0.003 

training rms 1 

Fig. 2. — Comparison of the performance (in 
terms of rms) on training data and blind data of 
the weight-selected networks trained on sample lc, 
component 1. The dashed line is the identity. 

We select the best networks on each sample and 
component simply by taking the network with the 
smallest squared errors on the training data, not 
taking into account their performance on the blind 
data. After having discarded overfitted networks 
according to their weights as described above, this 
results in networks performing consistently well on 
training data and blind data. We perform the fol- 
lowing analyses on the blind data sets only. As 



the results found on training and blind data agree 
within the statistical uncertainty, we use the com- 
plete sample of data sets for the analysis done in 
Section as this is necessary here. For sample 
lc, also the analysis in Section |4"771 is done exclu- 
sively on the blind sample. 

4.5. Shear measurement performance 

We analyze the performance of plain KSBg and 
the neural networks selected in the previous sec- 
tion. For each sample and component, we cal- 
culate both for the blind and the training sets a 
quality parameter 



Qi = 



10" 



(12) 



averaging the rms (cf. eqn. [TTj) over all galaxy sets 
within each sample. Resulting values of Q\ are 
shown in Tables [5] and |3J Note that Q\ is smaller 
than the GREAT08 quality parameter Q =: Qq 
because we do not average the residuals over sim- 
ilar set^l until section 14.71 

Results of the circularized samples lc to 3c are 
generally better in comparison to similar sets with 
anisotropic PSFs which have to be corrected for by 
a P slrL p termj^l A more detailed discussion of the 
advantages of circularization combined with bias 
correction is given in section 14.81 



4.6. Bias analysis I: linear bias 

We ca lculate additive and mu ltipl icative biases, 



follow ing iHevmans et al.l (|2006) and lMassev et al 
(|2007l ). We apply a linear fit of residual shears 
g° — g\ against true shears g\, i.e. 



9i ~ 9 l 



9i 



(13) 



Results for the six samples are shown in Ta- 
bles [2] and [3] and Figure [3] For neural network 
analysis, both multiplicative and additive bias are 
well within the range of most successful method s 
in the GREAT08 competition (|Bridle et al.ll2010l ). 
Th e requirements for f u ture s urveys as computed 
by lAmara fc Refregier ( 20081 ) are always fulfilled 
for medium and high signal to noise in terms of 
cf < 10~ 7 . For the multiplicative bias criterion, 



J That is, sets that have the same true shear, PSF, signal- 
to-noise and galaxy properties. 
6 cf. eqn. [4] 



G 



rrii < 10~ 3 , the network estimate is successful at 
least within an order of magnitude. The higher the 
signal-to-noise ratio, the better multiplicative bias 
can also be corrected, while especially for smaller 
signal there seems to be a tendency of to < for 
the neural network estimate, potentially due to the 
weaker shear signal. 

A modified plot of Cj and to at the three dif- 
ferent signal-to-noise levels with all methods par- 
ticipating in GREATO^ and including neural net- 
work blind estimates both with and without cir- 
cularization is shown in Figure 21 

The fact that the to and c found for blind data 
are consistent with the to and c found on train- 
ing data is additional evidence that the networks 
we use are not overfitted to the training data. In 
22 out of the 24 sets, components and bias pa- 
rameters, linear bias corresponds within lc, in the 
other two cases within 2a of the bias measurement 
uncertainty. 

4.7. Bias analysis II: 'six-pack' effect 

The bias of a method likely differs depending 
on properties such as signal-to-noise ratio, PSF, 
galaxy size and profiles. Therefore in the case of 
inhomogeneous samples like sample 1 and lc, the 
Ci and to we find for the whole sample are not 
necessarily equal to the true additive and multi- 
plicative biases of each homogeneous subsample. 
For this reason, a method simply calibrated for 
Ci = to ~ by an affine transformation for the 
whole inhomogeneous sample is potentially still 
biased on each of the homogeneous subsamples 
and consequently for any real-world application. 
Thus in order to correctly analyze the bias by 
means of fitting multiplicative and additive biases, 
each sample would have to be splitted into ho- 
mogeneous subsamples first. We develop another 
scheme of analyzing bias here which can be ap- 
plied to homogeneous and inhomogeneous samples 
alike. 

The GREAT08 data sets which we used for 
training the networks are made up of files con- 
taining 10,000 galaxies each, six of which again 
share the exact same shear values, observing con- 
ditions in terms of PSF, signal-to-noise ratio and 
galaxy properties. This is a very favorable set- 

7 cf. iBridle et al.l 1120101 ) for explanations of the methods' 
acronyms 



ting for bias analysis, as the composition of errors 
from bias and scatter can be estimated by com- 
paring the accuracy of shear estimates on single 
sets and on six times larger overall sets. 

When measuring the shear of a very large ho- 
mogeneous set j of n — > oo galaxies with true shear 
(#1) #2)1 tli e only residual for component i will be 
the bias b\ . For a linear bias to^ and , we would 
find b\ — m\ ■ g\ + q. In the limit of infinitely 
large sets the squared error of the shear estimate 
e\ will be 

Q~ l = (4 - 9if OD 2 • (14) 

For smaller n, the scatter Oj of the individual 
galaxy measurement in that set will add to the 
errors, leading to 

Q- 1 = (4-9i) 2 = (Hf + ^- (15) 

This bias and the scatter will of course depend of 
the properties of the particular set j. For the fol- 
lowing analysis, we are interested only in a decom- 
position of our total mean squared error into bias 
and scatter. We will thus calculate ((e^ — gl) 2 )j 
at two different sample sizes n = (10, 000, 60, 000) 
to find the root mean square of the bias bi := 

J ((b{) 2 ) 3 and the scatter a := yj{{^) 2 )j. 

We perform the analysis on the six samples. We 
calculate Qq, similar to equation 1121 vet averaging 
residuals over the six sets before taking the square. 
This is what the leaderboard for the GREAT08 
challenge used for ranking the submissions. Note 
that for pure scatter we would find Qe/Qi = 6, 
whereas for pure bias Qq/Q\ = 1. This favors 
methods with low bias which perform consistently 
well for different conditions and galaxy parame- 
ters. For large future surveys it is of particular 
importance to achieve a bias which is low even 
compared to the smaller scatter of the large sam- 
ple sizes (i.e. Q n /Qi ~ n), regardless of the indi- 
vidual galaxy properties or observing conditions. 
Except for sample lc where the number of blind 
sets available is large enough and completeness of 
the sets of 60000 galaxies has been conserved when 
separating the blind sets, we have to use the com- 
plete sample of galaxy sets (i.e. both the sets used 
for training and for blind testing) for calculating 
Qe- Where both Qi and the linear bias are con- 
sistent between the training and blind sets, this is 
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Table 2 

Performance and bias of different shear estimation methods on non-circularized samples 



method sample 11 comp. c° m b Qt c Qe° °" d b X 10 3d 











blind 




all f 




blind 




all 1 ' 


blind 


all/train g 








KSBs 


1 


1 
2 




2.95 ± 0.06 x 
5 ± 6 x 


10~ 3 
10~ 5 






-3.3 ± 0.2 x 
-1.4 ± 0.3 x 


io- 2 

10~ 2 


7.4 ± 0.3 
21.5 ± 0.8 


10.0 ± 0.9 
81 ± 7 


0.20 
0.20 


3.06 
0.74 


KSBs aff h 


1 


1 
2 






± 6 X 
± 6 x 


10~ b 
10~ 5 






± 2 x 
± 3 x 


10~ 3 
10" 3 


18.0 ± 0.7 
21.5 ± 0.8 


47 ±4 
82 ± 7 


0.20 
0.20 


1.20 
0.73 


KSBs+NN 1 


1 


1 
2 


± 2 x IO -4 
± 2 x 10" 4 


5 ± 5 x 
-1 ± 5 x 


10"" 
IO" 5 


-2 ± 1 x 
-2 ± 2 x 


10~ 2 
10~ 2 


-7 ± 2 x 10~ 3 
-1.2 ± 0.3 x 10~ 2 


30 ± 6 

31 ± 6 


26.2 ± 1.0 
27.1 ± 1.0 


133 ± 12 
141 ± 13 


0.19 
0.19 


0.37 
0.35 


KSB S 


2 


1 
2 




3.46 ± 0.04 x 
-3.0 ± 0.4 x 


10~ 3 
IO" 4 






2.6 ± 0.2 x 
3.8 ± 0.2 x 


10~ 2 
10~ 2 


7.3 ± 0.6 
95 ± 8 


8 ± 2 
160 ± 30 


0.07 
0.07 


3.64 
0.74 


KSBs aff h 


2 


1 
2 






0±4x 
± 4 x 


10"° 
10~ 5 






± 2 x 
± 2 x 


10~ 3 
10" 3 


187 ± 15 
210 ± 20 


1000 ± 200 
1300 ± 300 


0.07 
0.07 


0.13 
0.00 


KSBs+NN 1 


2 


1 
2 


± 1.5 > 
± 2 > 


( 10" 4 
( 10" 4 


2 ± 4 x 
± 4 x 


10~ s 
10~ 5 


2 ± 5 x 
-3 ± 9 x 


io- J 
io- 3 


-1 ± 2 x 10~ 3 
-2 ± 2 x 10~ 3 


200 ± 60 
160 ± 50 


220 ± 20 
260 ± 20 


1000 ± 200 
1900 ± 400 


0.07 
0.06 


0.16 
0.00 


KSB S 


3 


1 
2 






8 ± 2 x 
5 ± 2 x 


io~ 4 
io~ 4 






-1.83 ± 0.07 x 10 _1 
-1.49 ± 0.10 x 10 _1 


3.6 ± 0.3 
5.4 ± 0.4 


5.0 ± 1.0 

10 ± 2 


0.30 
0.32 


4.32 
2.87 


KSBs aff h 


3 


1 

2 






± 2 x 
± 2 x 


IO" 4 
10~ 4 






± 7 X 
± 1 x 


IO" 3 
10~ 2 


7.2 ± 0.6 
6.9 ± 0.6 


45 ± 9 
36 ± 7 


0.37 
0.38 


0.00 
0.63 


KSBs+NN 1 


3 


1 
2 


6 ± 7 > 

-4 ± 5 > 


< 10" 4 

< io~ 4 


3 ± 2 x 
-2 ± 2 x 


io~ 4 

10~ 4 


-4 ± 3 x 
-4 ± 3 x 


10~ 2 
10~ 2 


-2.4 ± 0.8 x 
-4 ± 1 x 


10~ 2 
10~ 2 


7±2 
12 ± 3 


9.7 ± 0.9 
9.3 ± 0.8 


45 ± 9 
41 ± 8 


0.31 
0.32 


0.79 
0.88 



a cf. Tabic for a description of the individual samples; sample 1 is the largest one with medium signal-to-noise and inhomogencous galaxy properties and thus the most 
realistic one 

b additivc and multiplicative bias for the individual component, as fitted in section |4.6l 

c Qe is the quality parameter as used by GREAT08, averaging residuals over 'six-packs' of 60000 galaxies from the complete sample, Q\ averages residuals over single sets 
of 10000 galaxies only; cf. cqn. |12l Qq/Q\ ~ 6 for unbiased methods, Qe/Qi ^ 1 where bias strongly dominates the noise at this sample size; cf. section |4.7l 

d cstimates of the rms of individual galaxy scatter a and bias b, averaging over the whole sample 

G quantity measured for KSBs+NN on the blind data only 

f quantity measured on the complete sample, using training and blind data for KSBg+NN 

s quantity measured on the complete sample for KSBs and KSBs aff, on training data only for KSBs+NN 

h thcsc arc KSBs outputs after an afHne transformation with the m; and Ci for the respective sample, such that the multiplicative and additive bias as fitted in section |4.6l 
disappear; cf. eqn. |16l and scction |4.7| 

^hese are the neural network corrected KSBs results 



Table 3 

Performance and bias of different shear estimation methods on circularized samples 



method 


samplc a 


comp. 




blind" 




all f 




blind" 


m b 


all 1 ' 


blind" 


all/train g 


Qe" 




b x 10 3d 


KSB S 


lc 


1 
2 




-2.8 ± 0.4 x 
-1.68 ± 0.04 x 


IO" 4 
10~ 3 


± 2 x 10~ 3 
1.0 ± 0.2 x 10~ 2 


29.3 ± 0.9 
15.9 ± 0.5 


84 ± 4 
24.0 ± 1.3 


0.16 
0.16 


0.87 
1.93 


KSB S aff h 


lc 


1 
2 






± 4 x 
± 4 x 


10~ 5 
10~ 5 






± 2 x 
± 2 x 


IO" 3 
10~ 3 


30.0 ± 0.9 
29.5 ± 0.9 


89 ± 5 
77 ± 5 


0.16 
0.16 


0.82 
0.95 


KSBs+NN' 


lc 


1 
2 


-8 ± 5 > 
-1.4 ± 0.4 > 


< 10~ b 

< 10~ 4 


-3 ± 4 x 
-7 ± 3 x 


10~ b 
10~ 5 


-5 ± 2 x 
-1.6 ± 0.3 x 


IO -3 
IO" 2 


-5.2 ± 1.5 x 
-1.3 ± 0.2 x 


10~ 3 
10~ 2 


37.4 ± 1.5 
41 ± 2 


40 ± 2 
46 ± 2 


200 ± 20 J 
180 ± 20 j 


0.16 
0.15 


0.22 
0.44 


KSBs 


2c 


1 
2 




9 ± 5 x 
-1.94 ± 0.04 x 


10~ b 
10~ 3 






2.5 ± 0.2 x 

3.6 ± 0.2 x 


io- 2 
io- 2 


100 ± 8 
21 ± 2 


180 ± 40 
24 ± 5 


0.07 
0.07 


0.67 
2.03 


KSBs aff" 


2c 


1 
2 






± 5 x 
± 4 x 


10~ b 
10~ 5 






± 2 x 
± 2 x 


IO" 3 
10~ 3 


172 ± 14 
210 ± 20 


670 ± 130 
1300 ± 300 


0.07 
0.07 


0.25 
0.00 


KSBs+NN 1 


2c 


1 
2 


0.3 ± 1.2 > 
-1 ± 2 > 


< 10" 4 

< io- 4 


± 4 x 

-2 ± 4 x 


10~ 5 
10~ 5 


2 ± 5 x 
-6 ± 8 x 


10~ 3 
10~ 3 


-1 ± 2 x 
-2 ± 2 x 


10~ 3 
IO" 3 


230 ± 70 
160 ± 50 


230 ± 20 
260 ± 20 


1000 ± 200 
1700 ± 300 


0.06 
0.06 


0.18 
0.00 


KSBs 


3c 


1 
2 






2.8 ± 0.2 x 

-8 ± 2 x 


10~ 3 

io~ 4 






-1.35 ± 0.07 x 
-1.08 ± 0.09 x 


io- 1 

IO" 1 


3.2 ± 0.3 
7.5 ± 0.6 


4.1 ± 0.8 
16 ± 3 


0.29 
0.29 


4.81 
2.24 


KSBs aff 11 


3c 


1 
2 






± 2 x 
± 2 x 


10~ 4 

io~ 4 






± 7 x 
± 9 x 


io- 3 

10~ 3 


8.5 ± 0.7 
9.2 ± 0.7 


40 ± 8 
47 ± 9 


0.33 
0.32 


0.82 
0.61 


KSBs+NN 1 


3c 


1 
2 


5 ± 6 > 
-3 ±5 ) 


< 10~ 4 

< io~ 4 


2 ± 2 x 
-2 ± 2 X 


10~ 4 

io~ 4 


-9 ± 3 x 
1 ± 3 x 


10~ 2 
10~ 2 


-3 ± 1 x 
-2 ± 1 x 


10~ 2 
10~ 2 


7 ± 2 
13 ± 4 


11.0 ± 0.9 
12.0 ± 0.9 


45 ± 9 
55 ± 11 


0.29 
0.28 


0.91 
0.69 



a cf. Tabled for a description of the individual samples; sample lc is the largest one with medium signal-to-noisc, inhomogencous galaxy properties and different PSFs and thus the 
most realistic one 

b additive and multiplicative bias for the individual component, as fitted in section |4.6l 

c Qq is the quality parameter as used by GREATOS, averaging residuals over 'six-packs' of 60000 galaxies from the complete sample, Q\ averages residuals over single sets of 10000 
galaxies only; cf. eqn. |12l Qe/Qi ~ 6 for unbiased methods, Qe/Qi ^ 1 where bias strongly dominates the noise at this sample size; cf. section |4.7| 

d cstimates of the rms of individual galaxy scatter a and bias 6, averaging over the whole sample 
c quantity measured for KSBs+NN on the blind data only 

^quantity measured on the complete sample, using training and blind data for KSBs+NN 

s quantity measured on the complete sample for KSBs and KSBs aff, on training data only for KSBs+NN 

h these are KSBs outputs after an affinc transformation with the m; and for the respective sample, such that the multiplicative and additive bias as fitted in section |4. 61 disappear; 
cf. eqn. |16l and scction |4.7| 

Hhese are the neural network corrected KSBs results 

3 Qe as found on the blind data only in the case of sample lc 



-0.04 -o.oa 



0.02 0.04 




0.02 0.04 



(a) sample 1/lc (medium signal to noise) 




(b) sample 2/2c (high signal to noise) 




Fig. 3. — Residual shear versus true shear for the neural network estimate KSB5+NN on different data sets 
with linear fits for the additive and multiplicative shear measurement bias. Left panels: non-circularized 
data; right panels: circularized data. Plotted is always component 1, component 2 of the shear is very 
similar. Each data point corresponds to the shear estimate for one set of 10000 galaxies. Black points and 
solid lines correspond to blind data, grey points and dashed lines to all data. 
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(b) additive bias 



Fig. 4. — multiplicative and additive bias as a function of signal-to-noise ratio for methods participating in 
the GREAT08 competition (see legend and cf. Bridle et al. ( 201C)t )) and neural network estimates on the blind 
samples with (dark grey) and without (light grey) circularization; the dashed black line is at m — c L = 0, 
which is also the result for KSB aff. as defined in section Wj\ 
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1000 




1 

1000 



Fig. 5. — Q at different sample sizes. 
Plotted are measured Q values for n = 
10, 000, 20, 000, 30, 000, 60, 000 galaxies on compo- 
nent 1 of sample lc for the neural network blind 
data estimate (triangles / solid line) and the affinc 
transformation of KSB outputs with m = 0, c = 
(squares / dashed line). The curves show the Q(n) 
from eqn. fTS] for the bias and scatter as found in 
Table [3] for the two methods. 

not expected to give different results than a test 
on purely blind data. 

We also compare the network results to the re- 
sults of KSBs output scaled with the parameters 
rrii and Ci found for KSBs on each of the samples 
in section l4~6l The result, defined as 



KSB aff. 



ef°/0.91 - c t 



(16) 



is denoted as KSB aff. in Figure 0] and Tables [5] 
and [31 By definition, KSB aff. has m = c, : = 
on any of the samples. Note that signal-to-noise 
dependence of the bias is likely the strongest in- 
fluence on bias of all the GREAT08 parameters!! 
A correction of this dependence by splitting into 
different signal-to-noise subsamples with different 
affine scalings, as has been done in the case of 
e KSB aff. ^ j g therefore the most promising method 
if one additional parameter is taken into account 
for bias correction. The dependence of the bias on 
the signal-to-noise ratio has in fact been taken into 



8 cf. Figures C3 and C4 in lBridle et~ail <20lJl 



account empirically in Schrabback et al. ( 2010h . 
The fact that the neural network estimate outper- 
forms e^ SB aff ' on the inhomogeneous samples 1 
and lc shows that the networks successfully take 
other dependences of the bias into account. For 
direct comparison, a plot of the quality parame- 
ter Q for the network estimate and KSB aff. as 
found for component 1 at different set sizes n 
on the circularized sample lc is shown in Fig- 
ure [5] Drawn are measured Q at sample sizes 
of n = 10000,20000,30000,60000. Note that this 
corresponds very well with the curve for eqn. [15] 
drawn in terms of b and a as found in Table [3] 
The larger residual bias of KSB aff. on the sub- 
sets of the inhomogeneous sample leads to a much 
smaller asymptotic Q than for the neural network 
estimate, which is projected to reach Q > 1000 at 
sufficiently large sample sizes. 

Complete results for plain KSBs, KSB aff. and 
the output of best networks are shown in Tables [5] 
and [3] A plot of the composition of squared errors 
is plotted in Figure H while the GREAT08 quality 
parameter Qq as a function of signal-to-noise ratio 
is shown in Figure [3 The bias is predominant in 
plain KSBs outputs, especially for the first com- 
ponent. This is particularly harmful as sample size 
and signal strength increases. Despite lowered sta- 
tistical uncertainty the strong bias leads to overall 
improvements being only slight. 

Affine transformations according to a fit to the 
known true shears greatly reduce the bias on the 
homogeneous subsamples 2, 2c, 3 and 3c. This is 
not surprising, as data sets here only differ by the 
shear applied to them and the change in bias due 
to this can be accounted for. On samples with in- 
homogeneous galaxy properties such as samples 1 
and lc, however, bias cannot be removed by this 
and remains significant. Also, the remaining scat- 
ter due to noise and differences in bias depend- 
ing on the individual galaxy properties within the 
sample cannot be decreased by the affine transfor- 
mation. 

Neural networks, on the contrary, greatly re- 
duce the bias in any of the samples such that it 
does not dominate the statistical errors at this 
sample size. This works remarkably well even on 
the inhomogeneous samples. The observed reduc- 
tion in scatter indicates that bias depending on 
the individual galaxy properties has been success- 
fully reduced as well. For a bias reduction scheme 
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Il 



10 £0 
SNR 



(a) non-circularized samples 



li 



(b) circularized samples 

Fig. 6. — Composition of mean squared errors 
(red: bias, blue: scatter) of the different methods 
(plain KSBg: KSB; KSBg after affine transforma- 
tion fitted to set multiplicative and additive bias 
to zero: aff.; KSB with neural network corrections: 
NN with total error as found for the blind sets) on 
samples with different signal-to-noise ratio. The 
middle blocks contains the medium signal-to-noise 
samples 1/lc with inhomogeneous galaxy proper- 
ties most similar to real data. 




(a) non-circularized samples 



1000 r 




(b) circularized samples 

Fig. 7. — Qq as a function of signal-to-noise ra- 
tio for the neural network estimate (solid line, 
triangles), KSB aff. (dashed line, squares) and 
plain KSB (dotted line, circles) plotted for non- 
circularized (upper panel) and non-circularized 
(lower panel) samples. 
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to be applied to real data with diverse proper- 
ties, this is of crucial importance. Therefore, as 
has been done with the neural networks, it can be 
seen that a linear bias of inhomogeneous samples 
close to zero should merely be achieved as a side 
effect of a proper overall calibration, which can be 
validated using other types of analyses as well. 

4.8. Circularization 

In order to compare the effects of PSF circular- 
ization to traditional P sm anisotropy correction, 
we compare the results of sample 1, 2 and 3 to 
sample lc, 2c and 3c. While these are similar in 
terms of signal-to-noise levels and galaxy proper- 
ties@ they differ in the method used for anisotropy 
correction. 

From the individual measurement a calculated 
in Tables [5] and [3] we find that scatter can be re- 
duced by circularization, especially for the more 
noisy data sets. This can also be seen from com- 
paring the left and right panels of Figure [3] and 
is not surprising as the smear responsitivity ten- 
sor P sm of the individual galaxy otherwise used for 
anisotropy correction certainly is influenced by the 
noise. The additional term P sm p in eqn. [4] adds 
to the scatter. Because p = for a circular PSF, 
this is not the case in the circularized samples. 

Circularization, however, appears to add to or 
at least change the bias present in the KSB5 
output. On the medium to low signal-to- noise 
data, calibration by the neural networks is success- 
ful such that the overall result can be improved. 
On the contrary, a fitted affine transformation of 
KSBs output on sample lc is still strongly biased. 

For sufficiently noisy data, circularization can 
thus successfully be used as a method of reducing 
scatter. Plain KSB5 outputs, however, even after 
traditional corrections, do not greatly profit from 
this as the residual bias dominates here. In order 
to benefit from the reduced scatter in circularized 
samples one has to combine circularization with a 
means of reducing bias, as has been done with the 
neural networks in this work. 



9 Sample lc contains 600 additional galaxies not in sample 
1, which are of fiducial properties but convolved with PSF 
2 or 3. 



4.9. Analyzing network output 

We continue to analyze the single galaxy net- 
work output as a function of KSB e lso for different 
subsamples of GREAT08. 

The network output for shear component j is a 
nonlinear function fj(v l ) of the input vector with 
KSB shear estimates and additional parameters v l 
of each galaxy i. One may interpret this, although 
not unambiguously, as an additive bias correction 
—Cj{v l ) and a weighting and multiplicative bias 
correction Wj(v l ) 

l 2 \ 

:=(l + mj V)) _1 -^y , (17) 

where (0? ) is the average variance of the measure- 
ments from the sample galaxy i has been taken 
from, dj(v l ) the variance of the measurement of 
the particular galaxy and rrij(v l ) a galaxy set 
property dependent multiplicative bias. The net- 
work output can then be written as 

f j (v i )=w j (v i )-(ef' i -c j (v i )) . (18) 

Consider now a version of the galaxy rotated 
by 90° so as to take the true ellipticity ej to 
its negative e'j = —ej and v l —> v % . As the 
variance cr^{v l ) and multiplicative bias rrij(v' 1 ) 
should remain constant under such a transforma- 
tion, an ideal network should use an unchanged 
Wj(v l ) = Wjfa 1 ). The additive bias correc- 
tion should be taken such that e 1 * ' 1 — Cj(v l ) = 

— (ej ' 1 — Cj(v l )J . This results in a point sym- 
metric distribution of neural network outputs as a 
function of ej° with respect to a zero shifted by the 
additive bias. Differences in Cj for different galax- 
ies will cause asymmetries in the distributions, but 
for \cj \ <C 1 these will only be slight. We therefore 
expect a non-overfitted network to give an almost 
point symmetric output distribution. 

For e lso bins we find percentiles of the net- 
work output corresponding to the median and 1- 
3cr in the case of a normal distribution. Plots 
of the resulting percentile curves of the networks 
trained on circularized data are shown in Figure[5] 
They show point symmetry, which is additional 
evidence that the networks we use are not over- 
fitted. The general shape can be interpreted as a 
high weighting of relatively circular galaxies and a 
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-1 -0.5 0.5 1 -1 -0.5 0.5 1 

KSB e, KSB e E 

(a) sample lc (medium signal to noise), both components 




-1 -0.5 0.5 1 -1 -05 0.5 1 

KSB e, KSB e, 

(b) sample 2c (high) and 3c (low signal to noise) 




-1 -0.5 0.5 1 -1 -0.5 0.5 1 

KSB e, KSB e, 



(c) branches from lc with large and small galaxy FWHM 

Fig. 8. — Percentiles (50th corresponding to the median, 14th and 86th (la), 2.5th and 97.5th (2a), 0.2th 
and 99.8th (3<r) of single galaxy network output as a function of KSB e lso for circularized samples. The red 
line is the identity. Both components are plotted for the overall sample lc in the top panel, component 1 is 
plotted for non-fiducial signal-to-noise sets 2c and 3c in the middle panel and for non-fiducial galaxy FWHM 
in the lower panel. 
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down-weighting of more elliptical galaxies. Slight 
differences can also be seen, for instance, between 
the output for the subsamples of sample lc with 
larger and smaller than fiducial galaxy FWHM on 
the left and right of the lower panel, respectively. 
None of the network corrections can be interpreted 
as a single affine transformation. 

The network output, therefore, must not be in- 
terpreted as the true ellipticity of the galaxies. 
It is merely a quantity that, if averaged arith- 
metically, gives a good ensemble shear estimate. 
For different applications, such as shear correla- 
tion measurements, different network inputs and 
cost functions can be used. Apart from defining a 
figure of merit and constructing the cost function 
such that the figure of merit is being maximized 
during training, one may in these cases make use 
of the rotational and permutational invariance of 
the expected output. 

5. Conclusion 

We have presented a scheme of neural networks 
which is capable of reducing bias in KSB shear 
measurements to a level where it no longer inhibits 
the success of future surveys. Bias correction was 
most successful on the medium to high signal-to- 
noise data sets. This result might give hints as to 
the most promising setup of future pipelines. 

We showed that circularization of the PSF re- 
duces the scatter as compared to PSF anisotropy 
correction based on a single galaxy P sm p term. 
Therefore, circularization of varying PSFs in com- 
bination with neural networks seems to be very 
promising for shear measurement on real data. 

Overall results in terms of shear measurement 
accuracy are very encouraging. By means of neu- 
ral networks, it was possible to calibrate tradi- 
tional KSB shape measurement to an accuracy 
competitive with the most successful methods and 
well above traditionally calibrated shape measure- 
ment approaches participating in the GREAT08 
challenge. On real data KSB remains the method 
most commonly used, which makes this improve- 
ment extremely valuable. The neural network 
scheme presented in this paper is, however, also a 
general approach. It can be applied to any other 
shear measurement pipeline that is using single 
galaxy parameters to find true shears by an aver- 
aging procedure. We expect that neural networks 



are able to reduce bias in these methods as well. 

The success of any shear measurement calibra- 
tion scheme, including neural networks, depends 
on the availability of data with known shear sim- 
ilar to the data to be analyzed. In the case of 
the GREAT08 challenge, this was available from 
the simulations themselves. For the application 
on real data, it is necessary to simulate training 
data sets with known shear values from and sim- 
ilar to the real data. This can be done by either 
fitting galaxy models to the objects to be analyzed 
and simulating sheared data from the fitted pro- 
files or_bvapp_lying a finite- resolution shear opera- 
tor ( Kaiser 2000h to the original image data itself. 
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Dark Universe" , the DFG Cluster of Excellence on 
the "Origin and Structure of the Universe" and the 
RTN-Network "DUEL" (Dark Universe through 
Extragalactic Lensing) gravitational lensing. We 
thank the Bonn lensing group for an introduction 
to KSB and for providing us with some of their 
scripts. We are also grateful to A. Collister and 
O. Lahav for making available their neural network 
implementation ANNz which the scheme presented 
has been built upon. 
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A. Appendix 

A.l. Back-Propagation of Errors 

The learning procedure of a neural network consists in finding a set of weights for the connections between 
nodes of adjacent layers Wij such that some cost function E of the network output becomes minimal. In 
principle, it is possible to achieve this using gradient descent, i.e. changing the weights in each step according 
to 

dE 

Aw - = -*e^ ' (A1) 

with T] being a small, positive parameter. The updating of weights can be done on the basis of individual 
data sets or after a batch of data sets have been processed. However, an efficient method of calculating the 
required derivative £ E needs to be found. 

For t he output layer the desi red output of the nodes and thus the required change of the weights is 
obvious. Rumelhart et al. (|l986f ) presented an algorithm based on the back-propagation of errors through 



the network which makes it possible to find efficiently for all weights Wij in a multi-layer Perceptron. 
We are going to introduce this algorithm here, as it is also the foundation of the neural networks we use. 
Using the chain rule we can rewrite the derivative 

dE dE dai dE , A „. 

a = « — « — = a~ x i ( A2 ) 

owij den owij oai 

in terms of derivatives with respect to nodes' inputs and their outputs or activations x,. 

Since the activations xj for any node can easily be found by feeding the respective data set to the network, 
the remaining task consists in calculating for each node i the quantity 5i often referred to as error 

Si:=^. (A3) 

For the output layer of nodes with activation functions /(a^) =: the derivative can be written as 

eout dE dE dxT* = dE_ , out (A ) 

da° ut dx° ut da° ut dx° uth ' 1 ' 

In the common case of identity activation functions for the output layer and squared error cost functions 
E = i ^2i(x° ut — Xi) 2 with true results Xi for single training sets this leads to the simple residuals of the 
output nodes' activations, 

gout = x out _ - . _ (A5) 

For a node j from any other layer we can simplify the expression for Sj using a sum over the nodes k of 
the following layer, requiring that the activation functions fj be differentiable: 

dE ^ dE da k ^ dE da k dx 3 , ^ 
j = da~ j = \^ da~k~da~ = \ da~ k tejfl^ = ^ \ ^ ' (A6) 

Thus with eqn. (IA6|) we can calculate one layer's 5s using the derivative of its activations functions, the 
subsequent layer's 5s and the weights of the connections between the two layers. The output layer's 5s given 
in eqn. (IA4|) . we can back-propagate the errors through the network to find 5i for each node. Finally, we 
can change the weights accordingly (using equations (|A1[) . (|A2|) and (|A3p ): 

dE dE 

Am > = -"8^ = ~ V da- Xj = - v6iXj ■ (A7) 

Applying these changes with a small enough parameter 77 > 0, it is possible to gradually minimize the 
cost function and therefore improve the network performance. Also, the gradient found can be used with a 
Quasi-Newton algorithm. 
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A. 2. Back-Propagation in Averaging Perceptrons 

For averaging Perceptrons we define the cost function as the sum of squared errors of the output averages 
(x° ut y of a meta set i of single data sets against known true average outputs Xi for the meta data set: 



We define errors 5 for each node, this time referring to a single data set m with unknown true result 
which is part of a meta data set with known average true result. Recalling definition (|A3| and the form of 
arithmetic averages we find 

« n= 9E 
3 da™ ' 

which simplifies for the output nodes to a constant error for all single data sets 

5°' m = ^((x^Y - Xi ) Vm, 
thus J2 m S ° Ut ' m = ( x ° ut ) 1 -h = <5° ut as defined in eqn. JA5]) above. 



For hidden nodes we find, in analogy to eqn. (|A6|) summing over nodes k of the subsequent layer, 

k 

which depends on m due to the different activations and therefore activation derivatives of the nodes for 
each primitive data set. Note that ^2 m SJ 1 is no longer equal to 6j as defined in eqn. (|A6j) . as /j m and 6™ 
are not necessarily independent. 

Calculating the derivative of the cost function with respect to each weight in turn we find the quantity 
needed for gradient descent: 



. 8E t-^ dE dal: ^ . 



dwa ^— ' da" 1 Own ^ 1 3 

An algorithm following this calculation therefore first has to find network averages for the meta data set 
under consideration. After having calculated the constant 5° nt from that, it has to feed each primitive data 
set m in turn, back-propagate S m through the network and sum S^xJ 1 for each weight toy. 
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